Skip to content

Mova#1337

Open
mi804 wants to merge 7 commits intomodelscope:mainfrom
mi804:mova
Open

Mova#1337
mi804 wants to merge 7 commits intomodelscope:mainfrom
mi804:mova

Conversation

@mi804
Copy link
Collaborator

@mi804 mi804 commented Mar 9, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Multimodal Open-source Video-Audio (MOVA) framework, a significant enhancement to the DiffSynth system. It enables the generation of coherent and synchronized video and audio content from textual and/or image prompts. The core innovation lies in the integration of dedicated audio generation models (DiT and VAE) and a sophisticated dual-tower conditional bridge that facilitates seamless interaction and conditioning between the visual and audio components throughout the diffusion process. This allows for the creation of rich, multimodal outputs, expanding the capabilities of the platform beyond video-only generation.

Highlights

  • Introduction of MOVA Models: Integrated new audio-specific Diffusion Transformer (DiT) and Variational Autoencoder (VAE) models, along with a Dual-Tower Conditional Bridge for multimodal generation.
  • Cross-Modal Interaction: Implemented a novel DualTowerConditionalBridge to facilitate bidirectional conditioning between visual and audio hidden states, enabling synchronized video and audio generation.
  • Unified Audio-Video Pipeline: Introduced a comprehensive MovaAudioVideoPipeline that orchestrates the entire generation process, from prompt embedding and noise initialization to denoising and final audio/video decoding.
  • VRAM Management Integration: Added VRAM management configurations for the new MOVA models, ensuring efficient memory usage during inference and training.
  • Distributed Acceleration Support: Enhanced the framework with Unified Sequence Parallelism (USP) for distributed processing, including new utility functions for chunking and gathering tensors.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • diffsynth/configs/model_configs.py
    • Added configurations for MOVA audio DiT, audio VAE, and dual-tower bridge models.
  • diffsynth/configs/vram_management_module_maps.py
    • Added VRAM management mappings for the new MOVA models.
  • diffsynth/models/mova_audio_dit.py
    • Implemented the MovaAudioDit model, an audio-specific Diffusion Transformer.
  • diffsynth/models/mova_audio_vae.py
    • Implemented the DacVAE (Digital Audio Codec VAE) model for audio encoding and decoding.
  • diffsynth/models/mova_dual_tower_bridge.py
    • Implemented the DualTowerConditionalBridge for cross-modal interaction between video and audio models.
  • diffsynth/models/wan_video_dit.py
    • Updated RMSNorm to optionally use PyTorch's native F.rms_norm and added a helper function set_to_torch_norm.
  • diffsynth/pipelines/mova_audio_video.py
    • Added the MovaAudioVideoPipeline for multimodal audio-video generation.
  • diffsynth/utils/data/media_io_mova.py
    • Added utility functions for saving video with synchronized audio.
  • diffsynth/utils/xfuser/init.py
    • Exported new sequence parallelism utility functions.
  • diffsynth/utils/xfuser/xdit_context_parallel.py
    • Implemented get_current_chunk and gather_all_chunks for distributed tensor handling.
  • examples/mova/acceleration/unified_sequence_parallel.py
    • Added an example script demonstrating MOVA pipeline with Unified Sequence Parallelism.
  • examples/mova/model_inference/MOVA-360p-TI2AV.py
    • Added an example script for 360p Text-to-Image-to-Audio-Video generation.
  • examples/mova/model_inference/MOVA-720p-TI2AV.py
    • Added an example script for 720p Text-to-Image-to-Audio-Video generation.
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the MOVA model for audio-video generation, including new models, a pipeline, and examples. The changes are extensive and well-structured. I've identified a few issues, including a critical typo in a configuration file that would break VRAM management, a potential device mismatch bug in the audio VAE model, an incorrect model name in a pipeline unit, and an inefficient import in the pipeline's model function. Addressing these will improve the correctness and performance of the new feature.

Note: Security Review did not run due to the size of the PR.

Comment on lines +265 to +269
"diffsynth.models.mova_audo_vae.DacVAE": {
"diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a typo in the module path for DacVAE. It's written as mova_audo_vae but it should be mova_audio_vae. This will cause a KeyError and prevent VRAM management from working for this model. The same typo exists for Snake1d inside the dictionary.

Suggested change
"diffsynth.models.mova_audo_vae.DacVAE": {
"diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",
},
"diffsynth.models.mova_audio_vae.DacVAE": {
"diffsynth.models.mova_audio_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",
},

Comment on lines +317 to +318
if self.deterministic:
return torch.Tensor([0.0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

torch.Tensor([0.0]) creates a tensor on the default device, which might not match the device of the model's parameters. This can lead to device mismatch errors during execution. You should specify the device when creating the tensor to ensure it's on the correct device.

Suggested change
if self.deterministic:
return torch.Tensor([0.0])
if self.deterministic:
return torch.tensor([0.0], device=self.parameters.device)

Comment on lines +336 to +337
if self.deterministic:
return torch.Tensor([0.0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the kl method, torch.Tensor([0.0]) here can cause a device mismatch. Please specify the device when creating the tensor to ensure it's on the correct device.

Suggested change
if self.deterministic:
return torch.Tensor([0.0])
if self.deterministic:
return torch.Tensor([0.0], device=self.parameters.device)

super().__init__(
input_params=("input_audio", "audio_noise"),
output_params=("audio_latents", "audio_input_latents"),
onload_model_names=("audio_vae_encoder",)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model name audio_vae_encoder is used here, but based on the pipeline's from_pretrained method and model_configs.py, the correct model name for the audio VAE is mova_audio_vae. While this code path is currently not used due to a NotImplementedError, it should be corrected for future implementation.

Suggested change
onload_model_names=("audio_vae_encoder",)
onload_model_names=("mova_audio_vae",)

Comment on lines +403 to +407
if use_unified_sequence_parallel:
from ..utils.xfuser import get_current_chunk, gather_all_chunks
else:
get_current_chunk = lambda x, dim=1: x
gather_all_chunks = lambda x, seq_len, dim=1: x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The from ..utils.xfuser import ... statement is inside the model_fn_mova_audio_video function. This will cause the import to be executed every time the function is called, which is inefficient, especially inside a loop. It's better to move the import to the top of the file and use a conditional assignment.

Suggested change
if use_unified_sequence_parallel:
from ..utils.xfuser import get_current_chunk, gather_all_chunks
else:
get_current_chunk = lambda x, dim=1: x
gather_all_chunks = lambda x, seq_len, dim=1: x
if not use_unified_sequence_parallel:
get_current_chunk = lambda x, dim=1: x
gather_all_chunks = lambda x, seq_len, dim=1: x

@mi804 mi804 self-assigned this Mar 12, 2026
@mi804 mi804 requested a review from Copilot March 12, 2026 10:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the openmoss/MOVA audio-video model family to DiffSynth-Studio, including a new MOVA audio-video pipeline, supporting model components, training scripts, inference/validation examples, and documentation entries.

Changes:

  • Add MovaAudioVideoPipeline plus MOVA model components (audio DiT, audio VAE, dual-tower bridge) and related VRAM management/model registry entries.
  • Introduce shared audio/video IO utilities (audio.py, audio_video.py) and migrate LTX2 examples to use the new audio reader.
  • Add MOVA training/inference shell/Python examples and document MOVA models in English/Chinese docs + READMEs.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pyproject.toml Moves audio deps to an optional extra and adds torchcodec.
examples/mova/model_training/validate_lora/MOVA-720p-I2AV.py MOVA LoRA validation example (720p).
examples/mova/model_training/validate_lora/MOVA-360p-I2AV.py MOVA LoRA validation example (360p).
examples/mova/model_training/validate_full/MOVA-720p-I2AV.py MOVA full-finetune validation example (720p).
examples/mova/model_training/validate_full/MOVA-360p-I2AV.py MOVA full-finetune validation example (360p).
examples/mova/model_training/train.py MOVA training entrypoint wiring dataset + training module.
examples/mova/model_training/lora/MOVA-720P-I2AV.sh LoRA training command(s) for MOVA-720p.
examples/mova/model_training/lora/MOVA-360P-I2AV.sh LoRA training command(s) for MOVA-360p.
examples/mova/model_training/full/MOVA-720P-I2AV.sh Full training command(s) for MOVA-720p.
examples/mova/model_training/full/MOVA-360P-I2AV.sh Full training command(s) for MOVA-360p.
examples/mova/model_inference_low_vram/MOVA-720p-I2AV.py Low-VRAM MOVA inference example (720p).
examples/mova/model_inference_low_vram/MOVA-360p-I2AV.py Low-VRAM MOVA inference example (360p).
examples/mova/model_inference/MOVA-720p-I2AV.py Standard MOVA inference example (720p).
examples/mova/model_inference/MOVA-360p-I2AV.py Standard MOVA inference example (360p).
examples/mova/acceleration/unified_sequence_parallel.py MOVA USP inference example.
examples/ltx2/model_inference_low_vram/LTX-2.3-T2AV-TwoStage-Retake.py Switches example audio reading to new read_audio.
examples/ltx2/model_inference_low_vram/LTX-2.3-A2V-TwoStage.py Switches example audio reading to new read_audio.
examples/ltx2/model_inference/LTX-2.3-T2AV-TwoStage-Retake.py Switches example audio reading to new read_audio.
examples/ltx2/model_inference/LTX-2.3-A2V-TwoStage.py Switches example audio reading to new read_audio.
docs/zh/Model_Details/Wan.md Adds MOVA entries to the model capability table (ZH).
docs/en/Model_Details/Wan.md Adds MOVA entries to the model capability table (EN).
diffsynth/utils/xfuser/xdit_context_parallel.py Adds chunk/gather helpers used by MOVA USP path.
diffsynth/utils/xfuser/init.py Re-exports new USP chunk/gather helpers.
diffsynth/utils/data/media_io_ltx2.py Re-exports write_video_audio_ltx2 from new shared writer.
diffsynth/utils/data/audio_video.py New shared video+audio writer based on PyAV.
diffsynth/utils/data/audio.py New shared audio utilities + torchcodec-based reader.
diffsynth/pipelines/mova_audio_video.py New MOVA audio-video diffusion pipeline implementation.
diffsynth/pipelines/ltx2_audio_video.py Ensures LTX2 inputs are stereo before mel conversion.
diffsynth/models/wan_video_dit.py Adds an option to use F.rms_norm in custom RMSNorm.
diffsynth/models/mova_dual_tower_bridge.py Adds dual-tower cross-modal bridge module for MOVA.
diffsynth/models/mova_audio_vae.py Adds MOVA audio VAE implementation.
diffsynth/models/mova_audio_dit.py Adds MOVA audio DiT implementation.
diffsynth/diffusion/base_pipeline.py Removes stereo enforcement in output_audio_format_check.
diffsynth/configs/vram_management_module_maps.py Adds VRAM management mappings for MOVA models.
diffsynth/configs/model_configs.py Registers MOVA models in the global MODEL_CONFIGS list.
README_zh.md Documents MOVA model support announcement + adds MOVA table entries (ZH).
README.md Documents MOVA model support announcement + adds MOVA table entries (EN).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +4
import torch
import torchaudio
from torchcodec.decoders import AudioDecoder
from torchcodec.encoders import AudioEncoder
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchaudio and torchcodec are imported at module import time, but they were moved out of the base dependencies into the optional audio extra. This means a default install (without extras) will fail just by importing diffsynth.utils.data.audio (and transitively any pipeline that imports it). Use lazy/optional imports (inside functions) with a clear error message, or keep these packages in the core dependencies.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +42
if samples.ndim == 1:
samples = samples[:, None]
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_write_audio() reshapes 1D audio as samples[:, None] (shape [T, 1]), but the rest of the function expects channel-first [C, T]. This will fail the stereo conversion/assertion for 1D audio. Normalize 1D audio to [1, T] (and validate 2D inputs are [C, T]) before calling convert_to_stereo().

Suggested change
if samples.ndim == 1:
samples = samples[:, None]
# Normalize input shape to channel-first [C, T] before stereo conversion.
if samples.ndim == 1:
# [T] -> [1, T]
samples = samples.unsqueeze(0)
elif samples.ndim == 2:
# Expect [C, T] with C in {1, 2}.
if samples.shape[0] not in (1, 2):
raise ValueError(
f"audio samples with 2 dimensions must be [C, S] with C in {{1, 2}}, got shape {tuple(samples.shape)}"
)
else:
raise ValueError(
f"audio samples must be 1D [S] or 2D [C, S], got {samples.ndim}D with shape {tuple(samples.shape)}"
)

Copilot uses AI. Check for mistakes.
ValueError: If an audio tensor is provided but the sample rate cannot be determined.
"""
duration = len(video) / fps
if audio_sample_rate is None:
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_video_audio() infers audio_sample_rate using audio.shape[...] even when audio is None, which will raise at runtime for video-only outputs. Only infer audio_sample_rate when audio is not None (or require it explicitly in that case).

Suggested change
if audio_sample_rate is None:
if audio is not None and audio_sample_rate is None:

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +264
def process(self, pipe: MovaAudioVideoPipeline, input_audio, audio_noise):
if input_audio is None or not pipe.scheduler.training:
return {"audio_latents": audio_noise}
else:
input_audio, sample_rate = input_audio
input_audio = convert_to_mono(input_audio)
input_audio = resample_waveform(input_audio, sample_rate, pipe.audio_vae.sample_rate)
input_audio = pipe.audio_vae.preprocess(input_audio.unsqueeze(0), pipe.audio_vae.sample_rate)
z, _, _, _, _ = pipe.audio_vae.encode(input_audio)
return {"audio_input_latents": z.mode()}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the training path, MovaAudioVideoUnit_InputAudioEmbedder.process() uses pipe.audio_vae but never calls pipe.load_models_to_device(self.onload_model_names). Under VRAM management/offloading this can leave audio_vae offloaded and cause failures or unexpected device/dtype behavior. Load the model in the training branch like the video embedder does.

Copilot uses AI. Check for mistakes.
def __init__(self):
super().__init__(
input_params=("input_video", "video_noise", "tiled", "tile_size", "tile_stride"),
output_params=("latents", "input_latents"),
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MovaAudioVideoUnit_InputVideoEmbedder declares output_params=("latents", "input_latents"), but the unit actually outputs video_latents and/or input_latents. output_params is used by the pipeline-unit dependency graph and training graph splitting; this mismatch can lead to incorrect unit splitting/caching behavior. Update output_params to match the real output keys.

Suggested change
output_params=("latents", "input_latents"),
output_params=("video_latents", "input_latents"),

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +124
if self.use_torch_norm:
return F.rms_norm(x, self.normalized_shape, self.weight, self.eps)
else:
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RMSNorm.forward() calls torch.nn.functional.rms_norm when use_torch_norm=True, but the project’s torch requirement is torch>=2.0.0 and F.rms_norm may not exist in older supported versions. Consider guarding with hasattr(F, "rms_norm") (fallback to the existing self.norm(...) path) or bump the minimum torch version accordingly.

Suggested change
if self.use_torch_norm:
return F.rms_norm(x, self.normalized_shape, self.weight, self.eps)
else:
if self.use_torch_norm and hasattr(F, "rms_norm"):
return F.rms_norm(x, self.normalized_shape, self.weight, self.eps)
else:

Copilot uses AI. Check for mistakes.
get_sequence_parallel_world_size,
get_sp_group)
from xfuser.core.long_ctx_attention import xFuserLongContextAttention
import torch.distributed as dist
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.distributed as dist is imported but not used in this module. Remove the unused import to avoid lint warnings and keep dependencies clearer.

Suggested change
import torch.distributed as dist

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +268
"diffsynth.models.mova_audo_vae.DacVAE": {
"diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VRAM module map key uses diffsynth.models.mova_audo_vae..., but the new model is defined in diffsynth/models/mova_audio_vae.py. With this typo (audo vs audio), the VRAM auto-wrapping rules for DacVAE/Snake1d will never be applied. Rename the map keys to the correct fully-qualified module path so VRAM management works for MOVA audio VAE.

Suggested change
"diffsynth.models.mova_audo_vae.DacVAE": {
"diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"diffsynth.models.mova_audio_vae.DacVAE": {
"diffsynth.models.mova_audio_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +24
Supports [C, T] or [B, C, T]. Duplicate mono, keep stereo.
"""
if audio_tensor.size(-2) == 1:
return audio_tensor.repeat(1, 2, 1) if audio_tensor.dim() == 3 else audio_tensor.repeat(2, 1)
return audio_tensor


Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_to_stereo() assumes audio_tensor has at least 2 dimensions (uses size(-2)), but other code paths/documentation allow 1D audio tensors [T]. Add explicit shape handling/validation (e.g., normalize to [C, T] first) so 1D inputs don’t raise an IndexError.

Suggested change
Supports [C, T] or [B, C, T]. Duplicate mono, keep stereo.
"""
if audio_tensor.size(-2) == 1:
return audio_tensor.repeat(1, 2, 1) if audio_tensor.dim() == 3 else audio_tensor.repeat(2, 1)
return audio_tensor
Supports [T], [C, T] or [B, C, T]. Duplicate mono, keep stereo.
"""
# Handle 1D input [T] by treating it as mono and returning [2, T].
if audio_tensor.dim() == 1:
return audio_tensor.unsqueeze(0).repeat(2, 1)
# Handle 2D input [C, T].
if audio_tensor.dim() == 2:
if audio_tensor.size(0) == 1:
# [1, T] -> [2, T]
return audio_tensor.repeat(2, 1)
return audio_tensor
# Handle 3D input [B, C, T].
if audio_tensor.dim() == 3:
if audio_tensor.size(1) == 1:
# [B, 1, T] -> [B, 2, T]
return audio_tensor.repeat(1, 2, 1)
return audio_tensor
raise ValueError(
f"Unsupported audio tensor shape {tuple(audio_tensor.shape)} for convert_to_stereo; "
"expected [T], [C, T], or [B, C, T]."
)

Copilot uses AI. Check for mistakes.

def enable_usp(self):
from ..utils.xfuser import get_sequence_parallel_world_size, usp_attn_forward
for block in self.video_dit.blocks + self.audio_dit.blocks + self.video_dit2.blocks:
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_usp() unconditionally iterates self.video_dit2.blocks, but video_dit2 can be None when only one DiT checkpoint is provided (see from_pretrained()). Guard this (e.g., only include video_dit2 blocks when it exists) to avoid an AttributeError when use_usp=True with a single DiT.

Suggested change
for block in self.video_dit.blocks + self.audio_dit.blocks + self.video_dit2.blocks:
all_blocks = []
if self.video_dit is not None:
all_blocks += self.video_dit.blocks
if self.audio_dit is not None:
all_blocks += self.audio_dit.blocks
if self.video_dit2 is not None:
all_blocks += self.video_dit2.blocks
for block in all_blocks:

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants